Introduction

This document will serve as a tutorial for using SCISSORS, with the added functionality of detailing exactly how the PBMC3k dataset figures presented in our manuscript were generated. We’ll start from a 10X counts matrix and end with fully annotated cell clusters. In addition to R, in order to run all the code in this document you’ll need a Python 3 installation with openTSNE and all its dependencies installed.

Libraries

library(pals)        # basic colors
library(dplyr)       # tidy data manipulation
library(Seurat)      # single cell infrastructure
library(ggplot2)     # plots
library(SCISSORS)    # our package 
library(paletteer)   # advanced colors
library(reticulate)  # Python interface
library(SeuratData)  # PBMC3k dataset

Data

We load a scRNA-seq dataset provided by 10X Genomics that consists of 2,700 peripheral blood mononuclear cells (PBMCs) from a healthy donor.

pbmc <- LoadData("pbmc3k")

Preprocessing

Here we use PrepareData() to normalize expression & select highly variable genes through sctransform, run PCA & t-SNE. and cluster our cells. We utilize 15 principal components even though the Satija Lab vignette used 10, as we use sctransform normalization, which does a better job of retaining biological heterogeneity through normalization than standard log-normalization.

pbmc <- PrepareData(pbmc, 
                    n.variable.genes = 4000, 
                    n.PC = 15, 
                    which.dim.reduc = "tsne",
                    initial.resolution = .4, 
                    random.seed = 629)
## [1] "Normalizing counts using SCTransform"
## [1] "Running t-SNE on 15 principal components with perplexity = 30"
## [1] "Clustering cells in PCA space using k ~ 52 & resolution = 0.4"
## [1] "Found 6 unique clusters"
p0 <- DimPlot(pbmc, cols = cols25()) + 
      labs(x = "t-SNE 1", y = "t-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p0

Fit-SNE

We’ll also run the Fast Fourier Transform-accelerated version of t-SNE as implemented in the Python library openTSNE. First we’ll need to get our PC matrix into a form accessible by our Python interpreter.

pc_mat <- Embeddings(pbmc, reduction = "pca")
# import libraries
import numpy as np
import pandas as pd
from openTSNE import TSNEEmbedding
from openTSNE import initialization
from openTSNE.affinity import Multiscale
# import PC matrix
pc_mat = np.array(r.pc_mat)
# run Fit-SNE w/ multiscale kernel
affin_multi = Multiscale(pc_mat, perplexities=[30, 150], metric='cosine', random_state=629)
init = initialization.pca(pc_mat, random_state=629)
tsne_obj = TSNEEmbedding(init, affin_multi, negative_gradient_method='fft')
embed = tsne_obj.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi = embed.optimize(n_iter=600, exaggeration=1, momentum=0.8)

Now we pull the results back into R, and save them in our Seurat object. We save the original embedding (made using the default Barnes-Hut t-SNE implementation) in pbmc@reduction$bh_tsne - we need to do this in order to make the Fit-SNE embedding the default embedding that will be retrieved in calls to functions such as DimPlot() or FeaturePlot().

embed <- as.matrix(py$embed_multi)
rownames(embed) <- colnames(pbmc)
colnames(embed) <- c("Fit-SNE_1", "Fit-SNE_2")
pbmc@reductions$bh_tsne <- pbmc@reductions$tsne
pbmc@reductions$tsne <- CreateDimReducObject(embeddings = embed, 
                                             key = "FitSNE_", 
                                             assay = "SCT", 
                                             global = TRUE)

Visualizing the results shows pretty much the same global structure as with the default t-SNE implementation, albeit rotated a bit, but I like that Fit-SNE’s clusters are a bit denser, so we’ll use Fit-SNE going forward.

p1 <- DimPlot(pbmc, cols = cols25()) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p1

Reclustering

Here we’ll examine clusters 0, 1, and 2. Cluster 0 seems large enough, and has enough variability on the X-axis of the t-SNE embedding to appear as though it might harbor subgroups. Clusters 1 and 2 have small but visible subclusters.

reclust_res <- ReclusterCells(pbmc, 
                              which.clust = list(0, 1, 2), 
                              n.variable.genes = 4000,
                              n.PC = 15, 
                              k.vals = c(5, 10, 15, 20, 25), 
                              resolution.vals = c(.1, .2, .3, .4), 
                              which.dim.reduc = "tsne", 
                              redo.embedding = TRUE, 
                              random.seed = 629)
## [1] "Reclustering cells in cluster 0 using k = 5 & resolution = 0.2, which achieved silhouette score: 0.394"
## [1] "Reclustering cells in cluster 1 using k = 33 & resolution = 0.2, which achieved silhouette score: 0.371"
## [1] "Reclustering cells in cluster 2 using k = 5 & resolution = 0.1, which achieved silhouette score: 0.6"

Now we’ll run Fit-SNE on each of the reclustered objects for consistencies sake. First we’ll need to isolate the PC matrices and send them to Python.

pc_clust0 <- Embeddings(reclust_res[[1]], reduction = "pca")
pc_clust1 <- Embeddings(reclust_res[[2]], reduction = "pca")
pc_clust2 <- Embeddings(reclust_res[[3]], reduction = "pca")

Running Fit-SNE three times could probably be cleaned up and done in a for loop, but this was easiest to write out in a short time. We use different perplexity sets for each cluster, as they’re all composed of differing numbers of cells.

# import PC matrices
pc_clust0 = np.array(r.pc_clust0)
pc_clust1 = np.array(r.pc_clust1)
pc_clust2 = np.array(r.pc_clust2)

# run Fit-SNE w/ multiscale kernel - cluster 0
affin_multi_clust0 = Multiscale(pc_clust0, perplexities=[15, 50], metric='cosine', random_state=629)
init_clust0 = initialization.pca(pc_clust0, random_state=629)
tsne_obj_clust0 = TSNEEmbedding(init_clust0, affin_multi_clust0, negative_gradient_method='fft')
embed_clust0 = tsne_obj_clust0.optimize(n_iter=300, exaggeration=12, momentum=0.7)
embed_multi_clust0 = embed_clust0.optimize(n_iter=850, exaggeration=1, momentum=0.8)

# run Fit-SNE w/ multiscale kernel - cluster 1
affin_multi_clust1 = Multiscale(pc_clust1, perplexities=[15, 30], metric='cosine', random_state=629)
init_clust1 = initialization.pca(pc_clust1, random_state=629)
tsne_obj_clust1 = TSNEEmbedding(init_clust1, affin_multi_clust1, negative_gradient_method='fft')
embed_clust1 = tsne_obj_clust1.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi_clust1 = embed_clust1.optimize(n_iter=850, exaggeration=1, momentum=0.8)

# run Fit-SNE w/ multiscale kernel - cluster 2
affin_multi_clust2 = Multiscale(pc_clust2, perplexities=[10, 30], metric='cosine', random_state=629)
init_clust2 = initialization.pca(pc_clust2, random_state=629)
tsne_obj_clust2 = TSNEEmbedding(init_clust2, affin_multi_clust2, negative_gradient_method='fft')
embed_clust2 = tsne_obj_clust2.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi_clust2 = embed_clust2.optimize(n_iter=750, exaggeration=1, momentum=0.8)

Now we bring the results back in to R.

embed0 <- as.matrix(py$embed_multi_clust0)
rownames(embed0) <- colnames(reclust_res[[1]])
colnames(embed0) <- c("Fit-SNE_1", "Fit-SNE_2")
reclust_res[[1]]@reductions$bh_tsne <- reclust_res[[1]]@reductions$tsne
reclust_res[[1]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed0, 
                                                           key = "FitSNE_",
                                                           assay = "SCT", 
                                                           global = TRUE)
embed1 <- as.matrix(py$embed_multi_clust1)
rownames(embed1) <- colnames(reclust_res[[2]])
colnames(embed1) <- c("Fit-SNE_1", "Fit-SNE_2")
reclust_res[[2]]@reductions$bh_tsne <- reclust_res[[2]]@reductions$tsne
reclust_res[[2]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed1,
                                                           key = "FitSNE_",
                                                           assay = "SCT",
                                                           global = TRUE)
embed2 <- as.matrix(py$embed_multi_clust2)
rownames(embed2) <- colnames(reclust_res[[3]])
colnames(embed2) <- c("Fit-SNE_1", "Fit-SNE_2")
reclust_res[[3]]@reductions$bh_tsne <- reclust_res[[3]]@reductions$tsne
reclust_res[[3]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed2,
                                                           key = "FitSNE_",
                                                           assay = "SCT",
                                                           global = TRUE)

Here’s what our reclusterings look like. There’s clear visual separation between the main clusters and the subgroups we’ve discovered using SCISSORS.

p2 <- DimPlot(reclust_res[[1]], cols = paletteer_d("ggsci::default_locuszoom")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p3 <- DimPlot(reclust_res[[2]], cols = paletteer_d("ggsci::default_locuszoom")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p4 <- DimPlot(reclust_res[[3]], cols = paletteer_d("ggsci::default_locuszoom")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p2

p3

p4

Next, we’ll reintegrate our new clusters into our original Seurat object - this requires some finagling as Seurat is a bit weird with how it stores cell-level metadata. Since we had six clusters originally, and we discovered six new subclusters, we’ll end up with twelve total clusters.

clust_df <- data.frame(CellID = colnames(pbmc), 
                       ClustID = as.numeric(pbmc@meta.data$seurat_clusters) - 1, 
                       stringsAsFactors = FALSE)
clust_6 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 1, ])
clust_7 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 2, ])
clust_8 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 3, ])
clust_9 <- rownames(reclust_res[[2]]@meta.data[reclust_res[[2]]@meta.data$seurat_clusters == 1, ])
clust_10 <- rownames(reclust_res[[3]]@meta.data[reclust_res[[3]]@meta.data$seurat_clusters == 1, ])
clust_11 <- rownames(reclust_res[[3]]@meta.data[reclust_res[[3]]@meta.data$seurat_clusters == 2, ])
label <- case_when(clust_df$CellID %in% clust_6 ~ 6, 
                   clust_df$CellID %in% clust_7 ~ 7, 
                   clust_df$CellID %in% clust_8 ~ 8, 
                   clust_df$CellID %in% clust_9 ~ 9, 
                   clust_df$CellID %in% clust_10 ~ 10, 
                   clust_df$CellID %in% clust_11 ~ 11, 
                   TRUE ~ clust_df$ClustID)
pbmc <- AddMetaData(pbmc, 
                    col.name = "seurat_clusters", 
                    metadata = as.factor(label))
Idents(pbmc) <- "seurat_clusters"
p5 <- DimPlot(pbmc, cols = cols25()) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p5

Identify Cell Types

Now that we’ve determined our subpopulations, we can assign cell types to each cluster using the marker genes provided in the Satija Lab’s PBMC3k vignette, as well as other canonical markers.

CD4+ T Cells

We can quickly identify cluster 0 as the memory CD4+ T cells, and cluster 6 as the naive CD4+ population.

p6 <- FeaturePlot(pbmc, 
                  features = "IL7R", 
                  cols = c("lightgrey", "firebrick3")) + 
      NoLegend() + 
      theme_yehlab() + 
      theme(axis.title = element_blank())
p7 <- FeaturePlot(pbmc, 
                  features = "CCR7", 
                  cols = c("lightgrey", "firebrick3")) + 
      NoLegend() + 
      theme_yehlab()  + 
      theme(axis.title = element_blank())
p8 <- FeaturePlot(pbmc, 
                  features = "S100A4", 
                  cols = c("lightgrey", "firebrick3")) + 
      NoLegend() + 
      theme_yehlab() + 
      theme(axis.title = element_blank())
p9 <- (p6 | p7 | p8) / p5
p9

Cluster 7 is only subtly separated from the CD4+ T cell clusters. We’ll use differential expression testing to determine if the cells in cluster 7 are a spurious cluster or a real T cell subtype. After running a Wilcox test we can see that several of the differentially expressed are associated with the interferon family of cytokines and with anti-viral immune responses.

clust7_markers <- FindAllMarkers(pbmc, 
                                 assay = "SCT", 
                                 logfc.threshold = .5, 
                                 only.pos = TRUE, 
                                 test.use = "wilcox", 
                                 verbose = FALSE, 
                                 random.seed = 629) %>% 
                  subset(cluster  == 7 & p_val_adj < .05) 
clust7_markers
p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
IFIT1 0e+00 0.8411350 0.68 0.073 0.0000000 7 IFIT1
IFIT3 0e+00 0.8418490 0.72 0.099 0.0000000 7 IFIT3
RSAD2 0e+00 0.5548527 0.48 0.051 0.0000000 7 RSAD2
MX1 0e+00 0.9419429 0.84 0.197 0.0000000 7 MX1
ISG152 0e+00 1.6141024 1.00 0.424 0.0000000 7 ISG15
IFI61 0e+00 1.4006506 0.96 0.377 0.0000000 7 IFI6
LGALS3BP 0e+00 0.5142930 0.44 0.059 0.0000000 7 LGALS3BP
STAT1 0e+00 0.6117354 0.64 0.127 0.0000000 7 STAT1
OASL 0e+00 0.5655251 0.36 0.043 0.0000000 7 OASL
OAS1 0e+00 0.5748545 0.72 0.183 0.0000000 7 OAS1
ISG20 0e+00 0.9174060 0.96 0.433 0.0000000 7 ISG20
MT2A1 0e+00 1.0596818 0.80 0.295 0.0000001 7 MT2A
IFI44L 0e+00 1.0540589 0.76 0.246 0.0000009 7 IFI44L
IFITM11 0e+00 0.7859682 0.92 0.450 0.0000043 7 IFITM1
LY6E1 0e+00 0.7496914 0.92 0.658 0.0002478 7 LY6E
TNFSF101 1e-07 0.7812960 0.60 0.202 0.0008592 7 TNFSF10
MYL12A1 3e-07 0.5222692 1.00 0.881 0.0040999 7 MYL12A

Upon visual inspection of the top three marker genes (as determined by adjusted p-value), we can see that they do an equally good job of distinguishing the small cluster from the sample as a whole as they do at separating it from the memory CD4+ T cells. Due to their anti-viral characteristics , we’ll define this group as being composed of Type 1 helper T cells (Th1).

p10 <- FeaturePlot(pbmc, 
                   features = "IFIT1", 
                   cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p11 <- FeaturePlot(pbmc, 
                  features = "IFIT3", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p12 <- FeaturePlot(pbmc, 
                  features = "IFI6", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p13 <- (p10 | p11 | p12) / p5
p13

CD14+ Monocytes

Cluster 1 clearly houses our CD14+ monocytes.

p14 <- FeaturePlot(pbmc, 
                   features = "CD14", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p15 <- FeaturePlot(pbmc, 
                   features = "LYZ", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p16 <- (p14 | p15) / p5
p16

FCGR3A+ Monocytes

It follows that the FCGR3A+ monocytes are positioned next to the CD14+ monocytes in cluster 4.

p17 <- FeaturePlot(pbmc, 
                   features = "FCGR3A", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p18 <- FeaturePlot(pbmc, 
                   features = "MS4A7", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p19 <- (p17 | p18) / p5
p19

B Cells

Expression of MS4A1 allows us to isolate the B cells as belonging to cluster 3.

p20 <- FeaturePlot(pbmc, 
                   features = "MS4A1", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p21 <- p20 / p5
p21

CD8+ T Cells

The canonical marker CD8A swiftly identifies our CD8+ T cells in cluster 2.

p22 <- FeaturePlot(pbmc, 
                   features = "CD8A", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p23 <- p22 / p5
p23

Natural Killer Cells

We can use NKG7 and GNLY to isolate the NK cells in cluster 5.

p24 <- FeaturePlot(pbmc, 
                   features = "NKG7", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p25 <- FeaturePlot(pbmc, 
                   features = "GNLY", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p26 <- (p24 | p25) / p5
p26

Dendritic Cells

The dendritic cell group is defined by expression of FCER1A and CST3 in cluster 9.

p27 <- FeaturePlot(pbmc, 
                   features = "FCER1A", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p28 <- FeaturePlot(pbmc, 
                   features = "CST3", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p29 <- (p27 | p28) / p5
p29

Platelets

The tiny platelet population of 11 cells can be identified by its expression of PPBP in cluster 10.

p30 <- FeaturePlot(pbmc, 
                   features = "PPBP", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p31 <- p30 / p5
p31

Plasmacytoid Dendritic Cells

A very small population of plasmacytoid dendritic cells - only 4 cells - was teased out by SCISSORS and can be annotated using expression of MZB1.

p32 <- FeaturePlot(pbmc, 
                   features = "MZB1", 
                   cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p33 <- p32 / p5
p33

Hematocytoblasts

Lastly, we need to assign an identity to the unknown cluster 8 - another rare cell population, also composed of only 4 cells. We’ll use differential expression to compare it as we did earlier with the Th1 cells. The marker CYTL1 is over-expressed in hematopoetic stem cells (hematocytoblasts).

clust8_markers <- FindAllMarkers(pbmc, 
                                 test.use = "wilcox", 
                                 only.pos = TRUE, 
                                 logfc.threshold = .75, 
                                 assay = "SCT", 
                                 verbose = FALSE, 
                                 random.seed = 629) %>% 
                  subset(cluster == 8 & p_val_adj < .05)
clust8_markers
p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
EGFL7 0e+00 0.8075975 0.75 0.003 0.0000000 8 EGFL7
CYTL1 0e+00 0.8057508 0.75 0.004 0.0000000 8 CYTL1
LZTS2 0e+00 2.2731955 0.50 0.004 0.0000000 8 LZTS2
RAB13 0e+00 0.7947415 0.75 0.015 0.0000000 8 RAB13
FCER1A 0e+00 1.1122867 0.75 0.019 0.0000000 8 FCER1A
SLC39A8 0e+00 0.7745096 0.75 0.036 0.0000000 8 SLC39A8
USP30 0e+00 1.8662538 0.25 0.006 0.0000020 8 USP30
IL1B 1e-07 0.9726204 0.75 0.075 0.0016298 8 IL1B
p34 <- FeaturePlot(pbmc, 
                   features = "CYTL1", 
                   cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p35 <- p34 / p5
p35

Final Figure

Finally, we’ll add cell labels to our original Seurat object and plot the results.

pbmc@meta.data$label <- case_when(pbmc@meta.data$seurat_clusters == 0 ~ "Memory CD4+ T", 
                                  pbmc@meta.data$seurat_clusters == 1 ~ "CD14+ Monocyte", 
                                  pbmc@meta.data$seurat_clusters == 2 ~ "CD8+ T", 
                                  pbmc@meta.data$seurat_clusters == 3 ~ "B", 
                                  pbmc@meta.data$seurat_clusters == 4 ~ "FCGR3A+ Monocyte", 
                                  pbmc@meta.data$seurat_clusters == 5 ~ "NK", 
                                  pbmc@meta.data$seurat_clusters == 6 ~ "Naive CD4+ T", 
                                  pbmc@meta.data$seurat_clusters == 7 ~ "Th1", 
                                  pbmc@meta.data$seurat_clusters == 8 ~ "Hematocytoblast", 
                                  pbmc@meta.data$seurat_clusters == 9 ~ "DC", 
                                  pbmc@meta.data$seurat_clusters == 10 ~ "Platelet", 
                                  pbmc@meta.data$seurat_clusters == 11 ~ "Plasmacytoid DC")

We redefine the final color palette to be used in order to have a better-looking figure. Note: if you’re manually defining colors and want Seurat to assign a certain color to a certain cell type or cluster number, you need to name the vector of colors you provide.

final_pal <- c("paleturquoise4", "goldenrod", "steelblue1", "lightseagreen", 
               "sienna4", "mediumblue", "coral3", "magenta", "limegreen", 
               "forestgreen", "darkorchid3", "orange")
names(final_pal) <- c("Memory CD4+ T", "CD14+ Monocyte", "CD8+ T", "B", 
                      "FCGR3A+ Monocyte", "NK", "Naive CD4+ T", "Th1", 
                      "Hematocytoblast", "DC", "Platelet", "Plasmacytoid DC")
p36 <- DimPlot(pbmc, cols = final_pal, group.by = "label") + 
       theme_yehlab() + 
       theme(legend.position = "bottom", 
             legend.justification = "center", 
             legend.text = element_text(size = 11)) + 
       guides(color = guide_legend(nrow = 3, override.aes = list(size = 3))) + 
       labs(x = "Fit-SNE 1", y = "Fit-SNE 2", title = NULL)
p36

Conclusions

SCISSORS revealed tiny platelet and plasmacytoid dendritic cell clusters that were initially grouped with the CD8+ T cells, and it helped us to separate the dendritic cells from the larger CD14+ monocyte cluster. It also split up the naive and memory CD4+ T cells, and showed us a tiny Th1 cell subset that was not initially visible. The plasmacytoid DCs and Th1 cells were not annotated in the original Satija Lab PBMC3k vignette.

We used the PBMC3k dataset because of 1) its immediate availability to anyone wishing to replicate our results and 2) the validity of its annotations, which allowed us to be confident in the results from SCISSORS, which was able to carve out rare cell groups from larger, broader cell types. In this case, the dendritic cell cluster was composed of 31 cells, the platelet cluster of 11 cells, the Th1 cluster of 25 cells, and the minuscule plasmacytoid DC cluster of just 4 cells. Respectively, these cell types made up 1.15%, 0.41%, 0.93%, and 0.15% of the entire sample. We thus believe we can confidently say that SCISSORS has been shown to accurately and swiftly identify rare cell types by considering the variance in gene expression within clusters and judging iterative reclustering using silhouette scores, rather than attempting to identify rare cell populations at the level of the entire dataset.

Save Data & Figures

This part isn’t really worth reading; it’s just here to prove that all the figures were actually dynamically generated and saved upon knitting this document.

pdrive_dir <- "/Volumes/labs/Home/Jen Jen Yeh Lab/Jack/Reclustering Project/Seurat Objects/pbmc3k_202101.Rds"
saveRDS(pbmc, file = pdrive_dir)

We’ll create a quick convenience function to help us save the figures.

saveSCISSORS <- function(plot = NULL, 
                         name = NULL, 
                         border = TRUE, 
                         pub.ready = FALSE) {
  if (is.null(plot) | is.null(name)) stop("You forgot some arguments.")
  if (pub.ready) {
    dir <- "~/Desktop/R/SCISSORS/vignettes/figures_pub/PBMC"
    if (!border) {
      plot <- plot + 
              theme(panel.border = element_blank(), 
                    axis.title = element_blank(), 
                    legend.position = "none")
    } else {
      plot <- plot + 
              theme(axis.title = element_blank(), 
                    legend.position = "none")
    }
    ggsave(filename = paste0(name, ".pdf"), 
           device = "pdf", 
           units = "in",
           path = dir, 
           height = 8, 
           width = 8) 
  } else {
    dir <- "~/Desktop/R/SCISSORS/vignettes/figures_supp/PBMC"
    ggsave(filename = paste0(name, ".pdf"), 
           device = "pdf", 
           units = "in",
           path = dir, 
           height = 8, 
           width = 8) 
  }
}

Lastly, we’ll save the figures under ./vignettes/figures/.

saveSCISSORS(plot = p0, name = "Seurat_Clusters", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p1, name = "Seurat_Clusters_FitSNE", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p2, name = "Clust0_Reclust", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p3, name = "Clust1_Reclust", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p4, name = "Clust2_Reclust", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p5, name = "SCISSORS_Clusters", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p6, name = "CD4T_IL7R")
saveSCISSORS(plot = p7, name = "CD4T_CCR7")
saveSCISSORS(plot = p8, name = "CD4T_S100A4")
saveSCISSORS(plot = p10, name = "TH1_IFIT1")
saveSCISSORS(plot = p11, name = "TH1_IFIT3")
saveSCISSORS(plot = p12, name = "TH1_IFI6")
saveSCISSORS(plot = p14, name = "Monocyte_CD14")
saveSCISSORS(plot = p15, name = "Monocyte_LYZ")
saveSCISSORS(plot = p17, name = "FCGR3A_Monocyte_FCGR3A")
saveSCISSORS(plot = p18, name = "FCGR3A_Monocyte_MS4A7")
saveSCISSORS(plot = p20, name = "B_MS4A1")
saveSCISSORS(plot = p22, name = "CD8T_CD8A")
saveSCISSORS(plot = p24, name = "NK_NKG7")
saveSCISSORS(plot = p25, name = "NK_GNLY")
saveSCISSORS(plot = p27, name = "DC_FCER1A")
saveSCISSORS(plot = p28, name = "DC_CST3")
saveSCISSORS(plot = p30, name = "Platelet_PPBP")
saveSCISSORS(plot = p32, name = "Plasmacytoid_DC_MZB1")
saveSCISSORS(plot = p34, name = "Hematocytoblasts_CYTL1")
saveSCISSORS(plot = p36, name = "SCISSORS_Final_Annotations", 
             pub.ready = TRUE, border = FALSE)

And of course:

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pbmc3k.SeuratData_3.1.4     SeuratData_0.2.1           
##  [3] reticulate_1.18             paletteer_1.3.0            
##  [5] SCISSORS_0.0.2.0            SingleCellExperiment_1.10.1
##  [7] SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
##  [9] matrixStats_0.57.0          Biobase_2.48.0             
## [11] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
## [13] IRanges_2.22.2              S4Vectors_0.26.1           
## [15] BiocGenerics_0.34.0         data.table_1.13.6          
## [17] cluster_2.1.0               biomaRt_2.44.1             
## [19] ggplot2_3.3.3               Seurat_3.2.3               
## [21] dplyr_1.0.2                 pals_1.6                   
## 
## loaded via a namespace (and not attached):
##   [1] BiocFileCache_1.12.1   plyr_1.8.6             igraph_1.2.6          
##   [4] lazyeval_0.2.2         splines_4.0.3          listenv_0.8.0         
##   [7] scattermore_0.7        digest_0.6.27          htmltools_0.5.0       
##  [10] fansi_0.4.1            magrittr_2.0.1         memoise_1.1.0         
##  [13] tensor_1.5             ROCR_1.0-11            limma_3.44.3          
##  [16] globals_0.14.0         askpass_1.1            prettyunits_1.1.1     
##  [19] colorspace_2.0-0       blob_1.2.1             rappdirs_0.3.1        
##  [22] ggrepel_0.9.0          xfun_0.19              prismatic_1.0.0       
##  [25] crayon_1.3.4           RCurl_1.98-1.2         jsonlite_1.7.2        
##  [28] spatstat_1.64-1        spatstat.data_1.7-0    survival_3.2-7        
##  [31] zoo_1.8-8              glue_1.4.2             polyclip_1.10-0       
##  [34] gtable_0.3.0           zlibbioc_1.34.0        XVector_0.28.0        
##  [37] leiden_0.3.6           future.apply_1.7.0     maps_3.3.0            
##  [40] abind_1.4-5            scales_1.1.1           DBI_1.1.0             
##  [43] miniUI_0.1.1.1         Rcpp_1.0.5             viridisLite_0.3.0     
##  [46] xtable_1.8-4           progress_1.2.2         bit_4.0.4             
##  [49] rsvd_1.0.3             mapproj_1.2.7          htmlwidgets_1.5.3     
##  [52] httr_1.4.2             RColorBrewer_1.1-2     ellipsis_0.3.1        
##  [55] ica_1.0-2              farver_2.0.3           pkgconfig_2.0.3       
##  [58] XML_3.99-0.5           uwot_0.1.10            dbplyr_2.0.0          
##  [61] deldir_0.2-3           labeling_0.4.2         tidyselect_1.1.0      
##  [64] rlang_0.4.10           reshape2_1.4.4         later_1.1.0.1         
##  [67] AnnotationDbi_1.50.3   munsell_0.5.0          tools_4.0.3           
##  [70] cli_2.2.0              generics_0.1.0         RSQLite_2.2.1         
##  [73] ggridges_0.5.2         evaluate_0.14          stringr_1.4.0         
##  [76] fastmap_1.0.1          yaml_2.2.1             goftest_1.2-2         
##  [79] rematch2_2.1.2         knitr_1.30             bit64_4.0.5           
##  [82] fitdistrplus_1.1-3     purrr_0.3.4            RANN_2.6.1            
##  [85] pbapply_1.4-3          future_1.21.0          nlme_3.1-151          
##  [88] mime_0.9               rstudioapi_0.13        compiler_4.0.3        
##  [91] plotly_4.9.2.2         curl_4.3               png_0.1-7             
##  [94] spatstat.utils_1.17-0  tibble_3.0.4           stringi_1.5.3         
##  [97] highr_0.8              lattice_0.20-41        Matrix_1.3-0          
## [100] vctrs_0.3.6            pillar_1.4.7           lifecycle_0.2.0       
## [103] lmtest_0.9-38          RcppAnnoy_0.0.18       cowplot_1.1.1         
## [106] bitops_1.0-6           irlba_2.3.3            httpuv_1.5.4          
## [109] patchwork_1.1.1        R6_2.5.0               promises_1.1.1        
## [112] KernSmooth_2.23-18     gridExtra_2.3          parallelly_1.23.0     
## [115] codetools_0.2-18       dichromat_2.0-0        MASS_7.3-53           
## [118] assertthat_0.2.1       openssl_1.4.3          withr_2.3.0           
## [121] sctransform_0.3.2      GenomeInfoDbData_1.2.3 mgcv_1.8-33           
## [124] hms_0.5.3              grid_4.0.3             rpart_4.1-15          
## [127] tidyr_1.1.2            rmarkdown_2.6          Rtsne_0.15            
## [130] shiny_1.5.0
---
title: "PBMC Analysis using SCISSORS"
subtitle: "Jack Leary"
author: 
  - "University of North Carolina at Chapel Hill - Lineberger Comprehensive Cancer Center"
  - "University of Florida - Department of Biostatistics"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: paper
    highlight: tango
    df_print: kable
    toc: true
    toc_float: true
    code_folding: show
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, 
                      message = FALSE, 
                      warning = FALSE, 
                      fig.align = "center")
reticulate::use_virtualenv("~/Desktop/Python/science/venv/", required = TRUE)
```

# Introduction

This document will serve as a tutorial for using `SCISSORS`, with the added functionality of detailing exactly how the PBMC3k dataset figures presented in our manuscript were generated. We'll start from a 10X counts matrix and end with fully annotated cell clusters. In addition to R, in order to run all the code in this document you'll need a Python 3 installation with `openTSNE` and all its dependencies installed.

# Libraries

```{r}
library(pals)        # basic colors
library(dplyr)       # tidy data manipulation
library(Seurat)      # single cell infrastructure
library(ggplot2)     # plots
library(SCISSORS)    # our package 
library(paletteer)   # advanced colors
library(reticulate)  # Python interface
library(SeuratData)  # PBMC3k dataset
```

# Data

We load a scRNA-seq dataset provided by 10X Genomics that consists of 2,700 peripheral blood mononuclear cells (PBMCs) from a healthy donor.

```{r}
pbmc <- LoadData("pbmc3k")
```

# Preprocessing

Here we use `PrepareData()` to normalize expression & select highly variable genes through `sctransform`, run PCA & t-SNE. and cluster our cells. We utilize 15 principal components even though [the Satija Lab vignette]((https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html)) used 10, as we use `sctransform` normalization, which does a better job of retaining biological heterogeneity through normalization than standard log-normalization.

```{r, warning=FALSE, message=FALSE}
pbmc <- PrepareData(pbmc, 
                    n.variable.genes = 4000, 
                    n.PC = 15, 
                    which.dim.reduc = "tsne",
                    initial.resolution = .4, 
                    random.seed = 629)
p0 <- DimPlot(pbmc, cols = cols25()) + 
      labs(x = "t-SNE 1", y = "t-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p0
```

## Fit-SNE

We'll also run the Fast Fourier Transform-accelerated version of t-SNE as implemented in the Python library `openTSNE`. First we'll need to get our PC matrix into a form accessible by our Python interpreter.

```{r}
pc_mat <- Embeddings(pbmc, reduction = "pca")
```

```{python}
# import libraries
import numpy as np
import pandas as pd
from openTSNE import TSNEEmbedding
from openTSNE import initialization
from openTSNE.affinity import Multiscale
# import PC matrix
pc_mat = np.array(r.pc_mat)
# run Fit-SNE w/ multiscale kernel
affin_multi = Multiscale(pc_mat, perplexities=[30, 150], metric='cosine', random_state=629)
init = initialization.pca(pc_mat, random_state=629)
tsne_obj = TSNEEmbedding(init, affin_multi, negative_gradient_method='fft')
embed = tsne_obj.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi = embed.optimize(n_iter=600, exaggeration=1, momentum=0.8)
```

Now we pull the results back into R, and save them in our `Seurat` object. We save the original embedding (made using the default Barnes-Hut t-SNE implementation) in `pbmc@reduction$bh_tsne` - we need to do this in order to make the Fit-SNE embedding the default embedding that will be retrieved in calls to functions such as `DimPlot()` or `FeaturePlot()`. 

```{r}
embed <- as.matrix(py$embed_multi)
rownames(embed) <- colnames(pbmc)
colnames(embed) <- c("Fit-SNE_1", "Fit-SNE_2")
pbmc@reductions$bh_tsne <- pbmc@reductions$tsne
pbmc@reductions$tsne <- CreateDimReducObject(embeddings = embed, 
                                             key = "FitSNE_", 
                                             assay = "SCT", 
                                             global = TRUE)
```

Visualizing the results shows pretty much the same global structure as with the default t-SNE implementation, albeit rotated a bit, but I like that Fit-SNE's clusters are a bit denser, so we'll use Fit-SNE going forward.

```{r}
p1 <- DimPlot(pbmc, cols = cols25()) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p1
```

# Reclustering

Here we'll examine clusters 0, 1, and 2. Cluster 0 seems large enough, and has enough variability on the X-axis of the t-SNE embedding to appear as though it might harbor subgroups. Clusters 1 and 2 have small but visible subclusters.

```{r, warning=FALSE, message=FALSE, results='hold'}
reclust_res <- ReclusterCells(pbmc, 
                              which.clust = list(0, 1, 2), 
                              n.variable.genes = 4000,
                              n.PC = 15, 
                              k.vals = c(5, 10, 15, 20, 25), 
                              resolution.vals = c(.1, .2, .3, .4), 
                              which.dim.reduc = "tsne", 
                              redo.embedding = TRUE, 
                              random.seed = 629)
```

Now we'll run Fit-SNE on each of the reclustered objects for consistencies sake. First we'll need to isolate the PC matrices and send them to Python. 

```{r}
pc_clust0 <- Embeddings(reclust_res[[1]], reduction = "pca")
pc_clust1 <- Embeddings(reclust_res[[2]], reduction = "pca")
pc_clust2 <- Embeddings(reclust_res[[3]], reduction = "pca")
```

Running Fit-SNE three times could probably be cleaned up and done in a for loop, but this was easiest to write out in a short time. We use different perplexity sets for each cluster, as they're all composed of differing numbers of cells.

```{python}
# import PC matrices
pc_clust0 = np.array(r.pc_clust0)
pc_clust1 = np.array(r.pc_clust1)
pc_clust2 = np.array(r.pc_clust2)

# run Fit-SNE w/ multiscale kernel - cluster 0
affin_multi_clust0 = Multiscale(pc_clust0, perplexities=[15, 50], metric='cosine', random_state=629)
init_clust0 = initialization.pca(pc_clust0, random_state=629)
tsne_obj_clust0 = TSNEEmbedding(init_clust0, affin_multi_clust0, negative_gradient_method='fft')
embed_clust0 = tsne_obj_clust0.optimize(n_iter=300, exaggeration=12, momentum=0.7)
embed_multi_clust0 = embed_clust0.optimize(n_iter=850, exaggeration=1, momentum=0.8)

# run Fit-SNE w/ multiscale kernel - cluster 1
affin_multi_clust1 = Multiscale(pc_clust1, perplexities=[15, 30], metric='cosine', random_state=629)
init_clust1 = initialization.pca(pc_clust1, random_state=629)
tsne_obj_clust1 = TSNEEmbedding(init_clust1, affin_multi_clust1, negative_gradient_method='fft')
embed_clust1 = tsne_obj_clust1.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi_clust1 = embed_clust1.optimize(n_iter=850, exaggeration=1, momentum=0.8)

# run Fit-SNE w/ multiscale kernel - cluster 2
affin_multi_clust2 = Multiscale(pc_clust2, perplexities=[10, 30], metric='cosine', random_state=629)
init_clust2 = initialization.pca(pc_clust2, random_state=629)
tsne_obj_clust2 = TSNEEmbedding(init_clust2, affin_multi_clust2, negative_gradient_method='fft')
embed_clust2 = tsne_obj_clust2.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi_clust2 = embed_clust2.optimize(n_iter=750, exaggeration=1, momentum=0.8)
```

Now we bring the results back in to R.

```{r}
embed0 <- as.matrix(py$embed_multi_clust0)
rownames(embed0) <- colnames(reclust_res[[1]])
colnames(embed0) <- c("Fit-SNE_1", "Fit-SNE_2")
reclust_res[[1]]@reductions$bh_tsne <- reclust_res[[1]]@reductions$tsne
reclust_res[[1]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed0, 
                                                           key = "FitSNE_",
                                                           assay = "SCT", 
                                                           global = TRUE)
embed1 <- as.matrix(py$embed_multi_clust1)
rownames(embed1) <- colnames(reclust_res[[2]])
colnames(embed1) <- c("Fit-SNE_1", "Fit-SNE_2")
reclust_res[[2]]@reductions$bh_tsne <- reclust_res[[2]]@reductions$tsne
reclust_res[[2]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed1,
                                                           key = "FitSNE_",
                                                           assay = "SCT",
                                                           global = TRUE)
embed2 <- as.matrix(py$embed_multi_clust2)
rownames(embed2) <- colnames(reclust_res[[3]])
colnames(embed2) <- c("Fit-SNE_1", "Fit-SNE_2")
reclust_res[[3]]@reductions$bh_tsne <- reclust_res[[3]]@reductions$tsne
reclust_res[[3]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed2,
                                                           key = "FitSNE_",
                                                           assay = "SCT",
                                                           global = TRUE)
```

Here's what our reclusterings look like. There's clear visual separation between the main clusters and the subgroups we've discovered using `SCISSORS`.

```{r}
p2 <- DimPlot(reclust_res[[1]], cols = paletteer_d("ggsci::default_locuszoom")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p3 <- DimPlot(reclust_res[[2]], cols = paletteer_d("ggsci::default_locuszoom")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p4 <- DimPlot(reclust_res[[3]], cols = paletteer_d("ggsci::default_locuszoom")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p2
p3
p4
```

Next, we'll reintegrate our new clusters into our original `Seurat` object - this requires some finagling as `Seurat` is a bit weird with how it stores cell-level metadata. Since we had six clusters originally, and we discovered six new subclusters, we'll end up with twelve total clusters.

```{r, message=FALSE}
clust_df <- data.frame(CellID = colnames(pbmc), 
                       ClustID = as.numeric(pbmc@meta.data$seurat_clusters) - 1, 
                       stringsAsFactors = FALSE)
clust_6 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 1, ])
clust_7 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 2, ])
clust_8 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 3, ])
clust_9 <- rownames(reclust_res[[2]]@meta.data[reclust_res[[2]]@meta.data$seurat_clusters == 1, ])
clust_10 <- rownames(reclust_res[[3]]@meta.data[reclust_res[[3]]@meta.data$seurat_clusters == 1, ])
clust_11 <- rownames(reclust_res[[3]]@meta.data[reclust_res[[3]]@meta.data$seurat_clusters == 2, ])
label <- case_when(clust_df$CellID %in% clust_6 ~ 6, 
                   clust_df$CellID %in% clust_7 ~ 7, 
                   clust_df$CellID %in% clust_8 ~ 8, 
                   clust_df$CellID %in% clust_9 ~ 9, 
                   clust_df$CellID %in% clust_10 ~ 10, 
                   clust_df$CellID %in% clust_11 ~ 11, 
                   TRUE ~ clust_df$ClustID)
pbmc <- AddMetaData(pbmc, 
                    col.name = "seurat_clusters", 
                    metadata = as.factor(label))
Idents(pbmc) <- "seurat_clusters"
p5 <- DimPlot(pbmc, cols = cols25()) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      theme(legend.position = "bottom", 
            legend.justification = "center", 
            legend.direction = "horizontal") + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p5
```

# Identify Cell Types

Now that we've determined our subpopulations, we can assign cell types to each cluster using the marker genes provided in [the Satija Lab's PBMC3k vignette](https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html), as well as other canonical markers.

## CD4+ T Cells

We can quickly identify cluster 0 as the memory CD4+ T cells, and cluster 6 as the naive CD4+ population.

```{r}
p6 <- FeaturePlot(pbmc, 
                  features = "IL7R", 
                  cols = c("lightgrey", "firebrick3")) + 
      NoLegend() + 
      theme_yehlab() + 
      theme(axis.title = element_blank())
p7 <- FeaturePlot(pbmc, 
                  features = "CCR7", 
                  cols = c("lightgrey", "firebrick3")) + 
      NoLegend() + 
      theme_yehlab()  + 
      theme(axis.title = element_blank())
p8 <- FeaturePlot(pbmc, 
                  features = "S100A4", 
                  cols = c("lightgrey", "firebrick3")) + 
      NoLegend() + 
      theme_yehlab() + 
      theme(axis.title = element_blank())
p9 <- (p6 | p7 | p8) / p5
p9
```

Cluster 7 is only subtly separated from the CD4+ T cell clusters. We'll use differential expression testing to determine if the cells in cluster 7 are a spurious cluster or a real T cell subtype. After running a Wilcox test we can see that several of the differentially expressed are associated with the interferon family of cytokines and with anti-viral immune responses.

```{r}
clust7_markers <- FindAllMarkers(pbmc, 
                                 assay = "SCT", 
                                 logfc.threshold = .5, 
                                 only.pos = TRUE, 
                                 test.use = "wilcox", 
                                 verbose = FALSE, 
                                 random.seed = 629) %>% 
                  subset(cluster  == 7 & p_val_adj < .05) 
clust7_markers
```

Upon visual inspection of the top three marker genes (as determined by adjusted p-value), we can see that they do an equally good job of distinguishing the small cluster from the sample as a whole as they do at separating it from the memory CD4+ T cells. Due to their anti-viral characteristics , we'll define this group as being composed of Type 1 helper T cells (Th1).

```{r}
p10 <- FeaturePlot(pbmc, 
                   features = "IFIT1", 
                   cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p11 <- FeaturePlot(pbmc, 
                  features = "IFIT3", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p12 <- FeaturePlot(pbmc, 
                  features = "IFI6", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p13 <- (p10 | p11 | p12) / p5
p13
```

## CD14+ Monocytes

Cluster 1 clearly houses our CD14+ monocytes.

```{r}
p14 <- FeaturePlot(pbmc, 
                   features = "CD14", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p15 <- FeaturePlot(pbmc, 
                   features = "LYZ", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p16 <- (p14 | p15) / p5
p16
```

## FCGR3A+ Monocytes

It follows that the FCGR3A+ monocytes are positioned next to the CD14+ monocytes in cluster 4.

```{r}
p17 <- FeaturePlot(pbmc, 
                   features = "FCGR3A", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p18 <- FeaturePlot(pbmc, 
                   features = "MS4A7", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p19 <- (p17 | p18) / p5
p19
```

## B Cells

Expression of MS4A1 allows us to isolate the B cells as belonging to cluster 3.

```{r}
p20 <- FeaturePlot(pbmc, 
                   features = "MS4A1", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p21 <- p20 / p5
p21
```

## CD8+ T Cells

The canonical marker CD8A swiftly identifies our CD8+ T cells in cluster 2.

```{r}
p22 <- FeaturePlot(pbmc, 
                   features = "CD8A", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p23 <- p22 / p5
p23
```

## Natural Killer Cells

We can use NKG7 and GNLY to isolate the NK cells in cluster 5.

```{r}
p24 <- FeaturePlot(pbmc, 
                   features = "NKG7", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p25 <- FeaturePlot(pbmc, 
                   features = "GNLY", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p26 <- (p24 | p25) / p5
p26
```

## Dendritic Cells

The dendritic cell group is defined by expression of FCER1A and CST3 in cluster 9.

```{r}
p27 <- FeaturePlot(pbmc, 
                   features = "FCER1A", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p28 <- FeaturePlot(pbmc, 
                   features = "CST3", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p29 <- (p27 | p28) / p5
p29
```

## Platelets

The tiny platelet population of `r sum(pbmc$seurat_clusters == 10)` cells can be identified by its expression of PPBP in cluster 10.

```{r}
p30 <- FeaturePlot(pbmc, 
                   features = "PPBP", 
                  cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p31 <- p30 / p5
p31
```

## Plasmacytoid Dendritic Cells

A very small population of plasmacytoid dendritic cells - only `r sum(pbmc$seurat_clusters == 11)` cells - was teased out by `SCISSORS` and can be annotated using expression of MZB1.

```{r}
p32 <- FeaturePlot(pbmc, 
                   features = "MZB1", 
                   cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p33 <- p32 / p5
p33
```

## Hematocytoblasts

Lastly, we need to assign an identity to the unknown cluster 8 - another rare cell population, also composed of only `r sum(pbmc$seurat_clusters == 8)` cells. We'll use differential expression to compare it as we did earlier with the Th1 cells. The marker CYTL1 is over-expressed in hematopoetic stem cells (hematocytoblasts).

```{r}
clust8_markers <- FindAllMarkers(pbmc, 
                                 test.use = "wilcox", 
                                 only.pos = TRUE, 
                                 logfc.threshold = .75, 
                                 assay = "SCT", 
                                 verbose = FALSE, 
                                 random.seed = 629) %>% 
                  subset(cluster == 8 & p_val_adj < .05)
clust8_markers
```

```{r}
p34 <- FeaturePlot(pbmc, 
                   features = "CYTL1", 
                   cols = c("lightgrey", "firebrick3")) + 
       NoLegend() + 
       theme_yehlab() + 
       theme(axis.title = element_blank())
p35 <- p34 / p5
p35
```

## Final Figure

Finally, we'll add cell labels to our original `Seurat` object and plot the results.

```{r}
pbmc@meta.data$label <- case_when(pbmc@meta.data$seurat_clusters == 0 ~ "Memory CD4+ T", 
                                  pbmc@meta.data$seurat_clusters == 1 ~ "CD14+ Monocyte", 
                                  pbmc@meta.data$seurat_clusters == 2 ~ "CD8+ T", 
                                  pbmc@meta.data$seurat_clusters == 3 ~ "B", 
                                  pbmc@meta.data$seurat_clusters == 4 ~ "FCGR3A+ Monocyte", 
                                  pbmc@meta.data$seurat_clusters == 5 ~ "NK", 
                                  pbmc@meta.data$seurat_clusters == 6 ~ "Naive CD4+ T", 
                                  pbmc@meta.data$seurat_clusters == 7 ~ "Th1", 
                                  pbmc@meta.data$seurat_clusters == 8 ~ "Hematocytoblast", 
                                  pbmc@meta.data$seurat_clusters == 9 ~ "DC", 
                                  pbmc@meta.data$seurat_clusters == 10 ~ "Platelet", 
                                  pbmc@meta.data$seurat_clusters == 11 ~ "Plasmacytoid DC")
```

We redefine the final color palette to be used in order to have a better-looking figure. Note: if you're manually defining colors and want `Seurat` to assign a certain color to a certain cell type or cluster number, you need to name the vector of colors you provide. 

```{r}
final_pal <- c("paleturquoise4", "goldenrod", "steelblue1", "lightseagreen", 
               "sienna4", "mediumblue", "coral3", "magenta", "limegreen", 
               "forestgreen", "darkorchid3", "orange")
names(final_pal) <- c("Memory CD4+ T", "CD14+ Monocyte", "CD8+ T", "B", 
                      "FCGR3A+ Monocyte", "NK", "Naive CD4+ T", "Th1", 
                      "Hematocytoblast", "DC", "Platelet", "Plasmacytoid DC")
p36 <- DimPlot(pbmc, cols = final_pal, group.by = "label") + 
       theme_yehlab() + 
       theme(legend.position = "bottom", 
             legend.justification = "center", 
             legend.text = element_text(size = 11)) + 
       guides(color = guide_legend(nrow = 3, override.aes = list(size = 3))) + 
       labs(x = "Fit-SNE 1", y = "Fit-SNE 2", title = NULL)
p36
```

# Conclusions

SCISSORS revealed tiny platelet and plasmacytoid dendritic cell clusters that were initially grouped with the CD8+ T cells, and it helped us to separate the dendritic cells from the larger CD14+ monocyte cluster. It also split up the naive and memory CD4+ T cells, and showed us a tiny Th1 cell subset that was not initially visible. The plasmacytoid DCs and Th1 cells were not annotated in [the original Satija Lab PBMC3k vignette](https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html). 

We used the PBMC3k dataset because of 1) its immediate availability to anyone wishing to replicate our results and 2) the validity of its annotations, which allowed us to be confident in the results from SCISSORS, which was able to carve out rare cell groups from larger, broader cell types. In this case, the dendritic cell cluster was composed of 31 cells, the platelet cluster of 11 cells, the Th1 cluster of 25 cells, and the minuscule plasmacytoid DC cluster of just 4 cells. Respectively, these cell types made up 1.15%, 0.41%, 0.93%, and 0.15% of the entire sample. We thus believe we can confidently say that SCISSORS has been shown to accurately and swiftly identify rare cell types by considering the variance in gene expression within clusters and judging iterative reclustering using silhouette scores, rather than attempting to identify rare cell populations at the level of the entire dataset.

# Save Data & Figures

This part isn't really worth reading; it's just here to prove that all the figures were actually dynamically generated and saved upon knitting this document.

```{r, eval=FALSE}
pdrive_dir <- "/Volumes/labs/Home/Jen Jen Yeh Lab/Jack/Reclustering Project/Seurat Objects/pbmc3k_202101.Rds"
saveRDS(pbmc, file = pdrive_dir)
```

We'll create a quick convenience function to help us save the figures.

```{r}
saveSCISSORS <- function(plot = NULL, 
                         name = NULL, 
                         border = TRUE, 
                         pub.ready = FALSE) {
  if (is.null(plot) | is.null(name)) stop("You forgot some arguments.")
  if (pub.ready) {
    dir <- "~/Desktop/R/SCISSORS/vignettes/figures_pub/PBMC"
    if (!border) {
      plot <- plot + 
              theme(panel.border = element_blank(), 
                    axis.title = element_blank(), 
                    legend.position = "none")
    } else {
      plot <- plot + 
              theme(axis.title = element_blank(), 
                    legend.position = "none")
    }
    ggsave(filename = paste0(name, ".pdf"), 
           device = "pdf", 
           units = "in",
           path = dir, 
           height = 8, 
           width = 8) 
  } else {
    dir <- "~/Desktop/R/SCISSORS/vignettes/figures_supp/PBMC"
    ggsave(filename = paste0(name, ".pdf"), 
           device = "pdf", 
           units = "in",
           path = dir, 
           height = 8, 
           width = 8) 
  }
}
```

Lastly, we'll save the figures under `./vignettes/figures/`. 

```{r}
saveSCISSORS(plot = p0, name = "Seurat_Clusters", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p1, name = "Seurat_Clusters_FitSNE", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p2, name = "Clust0_Reclust", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p3, name = "Clust1_Reclust", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p4, name = "Clust2_Reclust", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p5, name = "SCISSORS_Clusters", 
             pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p6, name = "CD4T_IL7R")
saveSCISSORS(plot = p7, name = "CD4T_CCR7")
saveSCISSORS(plot = p8, name = "CD4T_S100A4")
saveSCISSORS(plot = p10, name = "TH1_IFIT1")
saveSCISSORS(plot = p11, name = "TH1_IFIT3")
saveSCISSORS(plot = p12, name = "TH1_IFI6")
saveSCISSORS(plot = p14, name = "Monocyte_CD14")
saveSCISSORS(plot = p15, name = "Monocyte_LYZ")
saveSCISSORS(plot = p17, name = "FCGR3A_Monocyte_FCGR3A")
saveSCISSORS(plot = p18, name = "FCGR3A_Monocyte_MS4A7")
saveSCISSORS(plot = p20, name = "B_MS4A1")
saveSCISSORS(plot = p22, name = "CD8T_CD8A")
saveSCISSORS(plot = p24, name = "NK_NKG7")
saveSCISSORS(plot = p25, name = "NK_GNLY")
saveSCISSORS(plot = p27, name = "DC_FCER1A")
saveSCISSORS(plot = p28, name = "DC_CST3")
saveSCISSORS(plot = p30, name = "Platelet_PPBP")
saveSCISSORS(plot = p32, name = "Plasmacytoid_DC_MZB1")
saveSCISSORS(plot = p34, name = "Hematocytoblasts_CYTL1")
saveSCISSORS(plot = p36, name = "SCISSORS_Final_Annotations", 
             pub.ready = TRUE, border = FALSE)
```

And of course:

```{r}
sessionInfo()
```
